Construction Of Corpus-Based Syntactic Rules For Accurate Speech Recognition
نویسندگان
چکیده
This paper describes the syntactic rules which are applied in the Japanese speech recognition module of a speech-to-speech translation system. Japanese is considered to be a free word/phrase order language. Since syntactic rules are applied as constraints to reduce the search space in speech recognition, applying rules which take into account all possible phrase orders can have almost the same effect as using no constraints. Instead, we take into consideration the recognition weaknesses of certain syntactic categories and treat them precisely, so that a miuimal number of rules can work most effectively. In this paper we first examine which syntactic categories are easily misrecognized. Second, we consult our dialogue corpus, in order to provide the rules with great generality. Based ou both stndies, we refine the rules. Finally, we verify the validity of the refinement through speech recognition experiments. 1 I n t r o d u c t i o n We are developing the Spoken Language Tl~ANSlat ion system (SL-TRANS)[1], in which both speech recognition processing and natural language processing arc integrated. Currently we are studying automatic speech translation from Japanese into English in the domain of dialogues with the re ception service of an international conference office. In this framework we are constructing syntactic rules for recognition of Japanese speech. In speech recognition, the most significant concern is raising the recognition accuracy. For that purpose, applying linguistic information turns out to be promising. Various approaches have been taken, such as using stochastic models[2], syntactic rules[3], semantic information[4] and discourse plans[5]. Among stochastic models, the bigram and trigram succeeded in achieving a high recognition accuracy in languages that have a strong tendency toward a standard word order, such as English. On the contrary, Japanese belongs to free word order languages[6]. For such a language, semantic information is more adequate a.s a constraint. However, building semantic constraints for a large vocabulary needs a tremendous amount of data. Currently, our data consist of dialogues between the conference registration office and prospective conference participants with approximately 199,000 words in telephone conversations and approximately 72,000 words in keyboard conversations. But our data are still not sufficient to build appropriate semantic constraints for sentences with 700 distinct words. Processing a discourse plan requires excessive calculation and the study of discourse itself must be further developed to be applicable to speech recognition. On the other hand, syntax has been studied in more detail and makes increasing the vocabulary easier. As we are working on spoken language, we try to reflect real language usage. For this purpose, a stochastic approach beyond trigrams, namely stochastic sentence parsing[7], seems most promising. Ideally, syntactic rules should be generated automatically from a large dialogue corpus and probabilities should also be automatically assigned to each node. But to do so, we need underlying rules. Moreover, coping with phoneme perplexity, which is crucial to speech recognition, with rules created frmn a dialogue corpus, requires additional research[8]. In this paper we propose taking into account tile weaknesses of the speech recogniton system in the earliest stage, namely when we construct underlying syntactic rules. First, we examined the speech recognition results to determine which Syntactic categories tend to be recognized erroneously. Second, we utilized our dialogue corpus[9] to support the refinement of rules concerning those categories. As examples, we discuss formal nouns 1 and conjunctive postposi~ions 2. Finally, we carried out a speech recognition experiment with the refined rules to verify the validity of our approach. 1 Formal noun~ : keishiki-meishi in Japanese. Conjunctive postpositions : setsuzoku-joshi in Japanese. AcrEs DE COLING-92, NAr~TES, 23-28 AOt'~q" 1992 8 0 6 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 2 Issues in H M M L R Speech Recognit ion in the Japanese speech recognition module of our experimental system the combination of generalized I,R parsing and fIidden Markov Model (IIMM) is realized ~s IIMM-LR [10]. The system predicts phonetnes by using an LR parsing table and drives IIMM phoneme verifiers to detect/verify them without any intervening structure, such as a phoneme lattice. The speech recognition unit is a Japanese bunselsu, which roughly corresponds to a phrase and is the next largest unit after the word. The ending of the bunselsu (phrase) is usually marked by a breath point. This justities its t reatment as a distinct unit. A Japanese phrase consists of one independent word (e.g. noun, adverb, verb) and zero, one or more than one dependent words (e.g. postposition, auxiliary verb). The nmnber of words in a phreLse ranges from 1 to 14, and the mean number is about 3, according to our dialogue corpus. We will clarify the weaknesses of HMM-Llt speech recognition both in phrases and in sentences. 2.1 Phrase Recognition Errors We examined which syntactic categories tmtd to be erroneously recognized, when using IIMM-LR pltraae speech recognition. For this purl)ose , we applied syntactic rules containing no constraints on word sequences s. This me,ms tllat any word can follow any word. Examples (1) and (2) show the resnlts of IIMM-LH Japanese speech recognition "l. The nttered phoneme strings are enclosed in I I. (i) Isochirawal (this, that) > I : sochira-wa 2 : sochira-wa-hu 3 : sochira-ha-wa 4 : sochira-hu-wa-hu 5 : sochira-wa-hu-hu (2) laringatougozaimasul ( thank you) ............................................ i : ari~nga-to-wa-eN-hu-su-su-su 2 : ari-nga-to-wa-eN-hu--su-su 3 : ari-nga-to-wa-eN-hu~su-su-u 4: ari-nga-to-wa-eN-su-su S : ari-nga-to-wa-eN-hu-,~u-su-su-a 3Japttnese verbs, adjectives, etc. itl-e always inllected whett llsed. In syntactic lades colit~llillg 11o word sequence constraints, hfllected verbs, inflected adjectives, ctc. m-c considered to be words, 4The nlaxhna[ mnount of whole beam width, the global beam width, is set for 16 attd the xne~ximal beam width of each brmach, the local beam width, 10. In the examples, the symbols >, -, ng and N have special meaning: A correctly recognized plmme is nmrked with >. ® A word boundary is marked with -. A nasalized / g / i s transcribed ng. * A syllabic nasal is transcribed N. In (1), after recognizing the tirst word, the system selected subsequent words solely to produce a phoneme string similar to the original utterance. (2) is an example of phrase recognition which failed. In this example tou was erroneously recognized as to. Suhsequently, no fllrther correet words were selected. Examples (1) and (2) both show that IIMM-LR tends to select words consisting of extremely few phonemes when it fails in word recognition. To avoid this problem, precise rules should be written fin' sequences of words with small nnmbers of phonemes. In Japmmse, postpositions(e.g, ga, o, nit, wh-pronouiis(e.g, itsu, nani, claret[Ill, numerals(e.g. ichi, hi, san) and certain nouns(e.g, kata, mono) particularly tit this description. 2 . 2 S e n t e n c e i l . e c o g n i t i o n E r r o r s To exanfine the error tendency of sentence speech recognition we applied a two-step method[12]. First, we applied phra~e rules to the ItMM-LR speech recognition s. Second, we applied phrase-ba-sed sentence rules tt, the phrase candidates as a post-filter, in order to obtain sentence candidates, while filterins out unacceptable candidates. We experimented with the 353 phrases making up 1:/7 sentences. The recognition rate ff)r the top candidates wins 68.3 % by exact string tnatching, and for the top 5 candidates 95.5 %. Based on the top 5 phr~me candidates, we condncted a ;;entente experiment, ht this experiment we applied loosely constrained sentence rules. With these rules, altproxinnttely 80 % of all the possibh', combinations of phrase candidates were re-. cepted. Following are examples which did not exactly match the uttered sentences a . Notice that misrecognized words consist of a relatively small number of p h o n e l u e s j gig }ve have seen iil s e c t i o n 2.1. (3) l ka ing i~n i moubhiko-mi-tai-no-desu-nga [ (rl ~ould like go !egister for the conference. ) as: kaingi-ni moushJko~mi-tai-N-desu-nga 3b: kaingi.-ni moushiko-mi-gai-no-desu-ka (4) Ikochira-wa kaingizimukyoku-desul 5'fhe global beam width is set fin" 100 and tile local beam width 10. ~Since the phr~e candidates *tlv obtaiued by the I1MM-LIt speech recognitiolt, word botmdatie~ m'e Mready marked by -. AcrEs DE COLINGo92. NANTES. 23-28 ̂ ot'n 1992 8 0 7 I'r~oc. OF COLINGO2, NANTES. AUG. 23-28. 1992 (This i s the conference o f f i c e . ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4a: ka ta -wa kaingizimukyoku-desu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (5) [doumo a r i n g a t o u g o z a i m a s h i t a l (Thank you very much.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5a: go-o a r i n g a t o u g o z a i m a s h i t a 5b: go-me a r i n g a t o u g o z a i m a s h i t a 5c: mono a r i n g a t o u g o z a i m a s h i t a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (6) [gozyuusho-to onamae-o onengai sh i -masu[ (Can I have your name and add re s s? ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6a: gozyuushoto onamae-o onengai.sh±-masu-shi Though the phoneme string in 3a is different from the uttered phoneme string, the difference between no and N in meaning is minor, and has no effect on translation with the current technique. While (3) is affirmative, 3b is interrogative, which is indicated by the sentence final postposition ka. This cannot be treated with sentence rules. To haudle this problem, we need dialogue management. The uttered phrase koch i r a -~a in (4), meaning "this," was recognized erroneously as kat.a-wa in 4a, meaning "person." The word k a t a belongs to the formal noun group, a kind of noun which should be modified by a verbal phrase [13]. Sentence 4a is acceptable, if modified by a verbal phrase, as in 4a': 4a': midori-no seihukn-o kiteiru kata-wa kaigizimukyoku-desu (The p e r s o n who is wearing a green uniform is [with] the conference office.) This is also true of the phrase mono in 5c meaning "thing," which was erroneously recognized instead of doumo meaning "very much": 5c': kouka-na m o n o aringat-ou-gozaima-shi-ta (Thank you for the expensive th ing . ) In sentence candidates 5a and 5b, the numeral go, meaning "five," is used. These sentences may seem strange at first glance, but in a situation such as playing cards, these sentences are quite natural. If someone plays a 5 when you need one, you would say: "Thanks for the five." Similarly, when you need a 3 and a 5, and someone plays a 3 and after that someone else plays a 5, you would say: "Thanks for the five, too." In the sentence candidate 6a, the conjunetiveposlposilion (conj-pp) shi is used sentence finally. In principle, a conj~pp combines two sentences, functioning like a conjunction, such as "while" and "though," and is used in the middle of a sentence. Erroneous sentence recognition such as in the case of 3a-b cannot be treated by sentence rules. Therefore, we are trying to cope with erroneous recognition, as seen in sentence candidates 4a, 5a-c and 6a, with sentence rules. 3 Dealing with Speech Recognition E r r o r s We are going to deal with sentences containing tile following phrases: • Phrases with formal nouns • Phrases with numerals • Phrases with conj-pps used in the sentence final position In order to decide how to cope with the above problems, we used our dialogue corpus. Currently we have 177 keyboard conversations consisting of approximately 72,000 words and 181 telephone conversations consisting of approxilnately 199,000 words 7. We regard keyboard conversations as representing written Japanese and telephone conversations as representing spoken Japanese. When retrieving the dialogue corpus, we always compare written and spoken Japanese, in order to clarify the features of the latter. We examined the actuM usage of formal nouns as well as that of eonj-pps. 3 . 1 F o r m a l N o u n s We examined the behavior of formal nouns, such as koto and mono. Formal nouns are considered to be a kind of noun which lacks the content usually found in common nouns such as "sky" or "apple." They function similarly to relative pronouns and therefore are used with a verbal modifier[13], as in examples 7 and 8: 7 : kinou ilia koto~wa torikeshitai. (I would like to take back w h a t I said yesterday.) 8 : nedan-ga takai mono-ga shitsu-ga ii wakedewanai. (It is not always true that an expensive th ing has good quality.) In examples 7 and S, the formal nouns, kolo and mono, are modified by kinou ilia (yesterday said) and nedan-ga takai (price expensive), respectively. But it is also true that these nouns behave like common nouns and can be used without any verbal modifier, as in examples 9 and 10: 9 : sore-wa ko to desu ne. 7The dialogue corpus is ?.rowing constantly. When we retrieved formM nouns, we had 113 keyboard conversations and 96 telephone conversations. ACTES DE COLING-92, NANa~2S, 23-28 AOUT 1992 8 0 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 (It is a grave matter.) 10 : m ono-w a ta.shika-da. (This stuff is trustworthy.) Considering the examples 7-10, we coukl define two kinds of usage for formal nouns. This distinction is applicable to sentence analysis, but is meaningless from the standpoint of applying syntactic rules ms constraints. 3.1.1 F o r m a l N ouns in t he C o r p u s Ill our dialogue corpus, koto, mono, hou and kata are tile most frequently used formal nouns. Table 1 shows how often tile formal nouns are used with a verbal modifier. We have also rctrieved formal nouns used in the sentence initial position, w~ in example 10. Table 1: Formal Nouus Keyboard ] T e l e p h o ~ With Verb. Mod. Without Verb. Mod. Sent. Initial Total 214 3
منابع مشابه
Allophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملCultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis
This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملOverview of SHRC - Ginkgo speech synthesis system for Blizzard Challenge 2013
This paper introduces the SHRC-Ginkgo speech synthesis system for Blizzard Challenge 2013. A unit selection based approach is adopted to develop our speech synthesis system using audiobook speech corpus. Aiming at roughly labeled corpora with several hundred hours of speech, our system adopts lightlysupervised acoustic model training of speech recognition to select clean speech data with accura...
متن کامل